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Abstract — In this work, conditional entropy is used to quantify 
the information loss induced by passing a continuous random 
variable through a memoryless nonlinear input-output system. 
We derive an expression for the information loss depending on 
the input density and the nonlinearity and show that the result is 
strongly related to the non-injectivity of the considered system. 
Tight upper bounds are presented, which can be evaluated with 
less difficulty than a direct evaluation of the information loss, 
which involves the logarithm of a sum. Application of our results 
is illustrated on a set of examples. 

I. Introduction 

Information processing, in the sense of changing the infor- 
mation of or retrieving information from a signal, can only 
be accomplished by nonlinear systems, while causal, stable, 
linear systems do not affect a signal's entropy rate JT], |]2] 
pp. 663]. As a consequence, in the past information-theoretic 
measures in system analysis were almost exclusively used for 
highly nonlinear, chaotic systems, mainly motivated by the 
works of Kolmogorov |@) and Sinai J5). On the con- 
trary, linear systems and relatively simple nonlinear systems 
(e.g., containing static nonlinearities) usually lack information- 
theoretic descriptions and are often characterized by second- 
order statistics or energetic measures (e.g., transfer function, 
power spectrum, signal-to-distortion ratio, mean square error 
between input and output, correlation functions, etc.). 

In this work, we characterize the amount of information 
lost by passing a signal through a static nonlinearity. These 
systems, although simple, are by no means irrelevant in 
technical applications: One of the major components of the 
energy detector, a low-complexity receiver architecture for 
wireless communications, is a square-law device. Rectifiers, 
omnipresent in electronic systems are another example for 
static nonlinearities, which further constitute the nonlinear 
components in Wiener and Hammerstein systems. This work 
thus acts as a first step towards the goal of a comprehensive 
information-theoretic framework for more general nonlinear 
systems, providing an alternative to the prevailing energetic 
descriptions. While an analysis of information rates will be 
left for future work, this paper is concerned with zeroth-order 
entropies only. 

Information loss can most generally be expressed as the 
difference of mutual informations, 



I(X;X)-I(X;Y) 



(1) 



Fig. [T). This kind of information loss is of particular interest 
for leaming/coding/clustering (e.g., word clustering [6]) and 
triggered the development of optimal representation tech- 
niques [7|. Generally, changing the description from X to Y 
does not necessarily imply that the information loss is non- 
negative. In the special case that Y is a function of X - the 
case we are concerned with - the data processing inequality 
states that information can only be lost (HO pp. 35]. In other 
words, the difference in ((TJ is non-negative. 

In case X is identical to the RV X itself, the information 
loss simplifies to (cf. proof of Theorem Q3 



H(X\Y) 



(2) 



where the random variables (RV) X and Y are two descrip- 
tions for another RV X. In words, the difference in (Q3 is the 
information lost by changing the description from X to Y (cf. 



i.e., to the conditional entropy of X given the description 
Y, This equivocation, as Shannon termed it in his seminal 
paper [1|, was originally used to describe the information 
loss for stochastic relations between the RVs X and Y. In 
contrary to that, we are concerned with deterministic functions 
Y = g(X). 

To our knowledge, little work has been done in this regard. 
Some results are available for the capacity of nonlinear chan- 
nels [9 1, [10 1, and recently the capacity of a noisy (possibly 
nonlinear and non-injective) function was analyzed JTlJ, fl2l . 
Considering deterministic systems, we found that Pippenger 
used equivocation to characterize the information loss induced 
by multiplying two integer numbers [13], while the coarse 
observation of discrete stochastic processes is analyzed in [ 14 1. 
An analysis of how much information is lost by passing a 
continuous RV through a static nonlinearity cannot be found 
in the literature. 

Aside from providing information-theoretic descriptions for 
the nonlinear systems mentioned above, our results also apply 
to different fields of signal processing and communication 
theory. To be specific, the information loss may prove useful 
to compute error bounds for the reconstruction of nonlinearly 
distorted signals J8] pp. 38] and in capacity considerations for 
nonlinear channels. To give another example, according to ifTTI 
the capacity of a noisy function G(-) (a noisy implementation 
of the determinisitic function #(•)), is given as the maximum 
of 

H(X\Y) + I(G{X)-Y) (3) 

over all (discrete) distributions of X. In this work, we give an 
expression for the first part of this equation, assuming that X 
is a continuous RV. 

After introducing the problem statement in Section |II] an 
expression for the information loss is derived and related to 
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Fig. 1. Equivalent model for computing the information loss of an 
input-output system with static nonlinearity g(-). Q(-) is a quantizer with 
quantization step size A^. Note that X can be modeled as the sum of X and 
an input-dependent finite-support noise term N as in 1151 . 



the non-injectivity of the system in Section Hill while bounds 
on the information loss are presented in Section [IV] Section [V] 
illustrates the theoretical results with the help of examples. 

This is an extended version of a paper submitted to an IEEE 
conference 1 15 1. 

II. Problem Statement 

We focus our attention on a class of systems whose input- 
output behavior can be described by a piecewise strictly 
monotone function. While this excludes functions which are 
constant on some proper interval (e.g., limiters or quantizers, 
for which it can be shown that the information loss becomes 
infinite), many well-behaved functions can be interpreted in 
the light of the forthcoming Definition. Take, e.g., the function 
g(x) = cos(a:) for some X = [0, Lit). While the function is 
clearly not monotone on X, it is strictly monotone for all 
Xi = [(i — l)ir, i7r), i = 1, As it turns out, piecewise 

strict monotonicity does not rule out functions whose deriva- 
tive is zero on a finite set. In addition to that, neither continuity 
nor differentiability are requirements imposed by Definition!!] 
but only piecewise continuity and piecewise differentiability. 

Definition 1. Let g: X — > y, X, y C R, be a bounded, sur- 
jective, Borel measurable function which is piecewise strictly 
monotone on L subdomains Xi 



52 0), 



if x £ Xi 
if x £ X 2 



(4) 



^l(x), if x £ X L 



where gf. X[ —> 3^ are bijective. We assume that the sub- 
domains are an ordered set of disjoint, proper intervals with 
Xi — X and X{ < xj for all Xi £ Xi, Xj £ Xj whenever 
i < j. We further require all <#(•) to be differentiable on the 
interval enclosure of Xi. 

Note that X does not need to be an interval itself. Strict 
monotonicity implies that the function is invertible on each 
interval Xi, i.e., there exists an inverse function gj~ : — > Xi, 
where J 7 ; is the image of Xi. However, the function g(-) 
needs not be invertible on X, i.e., it can be non-injective. 
Equivalently, the images of the intervals, yi, unite to y, 
but need not be disjoint. Let g(-) describe the input-output 
behavior of the system under consideration (see Fig. [T). 



As an input to this system consider a sequence of in- 
dependent samples, identically distributed with continuous 
cumulative distribution function (CDF) Fx (x) and probability 
density function (PDF) fx(x). Without loss of generality, let 
the support of this RV be X, i.e., fx(x) is positive on X and 
zero elsewhere. 

As an immediate consequence of this system model, the 
conditional PDF of the output Y given the input X can be 
written as [16] 



iel(y) 19 (Xt)l 



(5) 



where S(-) is Dirac's delta distribution, I (y) = {i : y £ y{\ 
and Xi — g^ 1 {y) for all i € I (y). In other words, {xi} is the 
preimage of y or the set of roots satisfying y — g(x). The 
marginal PDF of Y is thus given as J2] pp. 130], lfl6l 

My) = E w> 

III. Information Loss of Static Nonlinearities 

In what follows we quantify the information loss induced 
by g(-), and we show that this information loss is identical 
to the remaining uncertainty from which interval Xi the 
input x originated after observing the output y. The main 
contribution of this work is thus concentrated in the following 
two Theorems. 

Lemma 1. Let g: X — > y and f: y — > Z be measurable 
functions. Further, let X be a continuous RV on X and Y = 
g{X). Then, 

E {/(!/)}= / f(y)f Y (y)dy= [ f(g(x))f x (x)dx. (7) 
Jy Jx 

Proof: The proof is based on the fact that for a measurable 
g(-) lEl pp. 142, Theorem 5-1] 



E {g( x )}= / g{x)fx{x)dx. 



(8) 



x 



Lemma Q] follows from (0 since for measurable g(-) and /(•) 
the composition (/ o g)(-) = f(g(-)) is also measurable. ■ 

Theorem 1. The information loss induced by a function g(-) 
satisfying Definition [7] is given as 

H{X\Y) = f fx(x)\og I ^W))^ ) dx . (9 ) 

X \ |9'(*)I / 

Proof: Using identities from [8] and the model in Fig. Q] 
the conditional entropy H(X\Y) can be calculated as 



H{X\Y)= lim (H(X\Y) - H(X\X) 
x^x ^ 



lim (H(X) - H(X\X) - H(X) + H(X\Y) 

x^x 



lim (l{X;X) -I(X;Y) 



x^x 



(10) 



where I is a discrete RV converging surely to X. This 
auxiliary RV is necessary to ensure that the (discrete) entropies 
we use are well-defined. Here, motivated by the data process- 
ing inequality f8] pp. 34], we have related the conditional 
entropy to a difference of mutual informations, which we 
have introduced as the most general notion of information loss 
in Section |U In addition to that, the mutual information has 
the benefit that it is defined for general joint distributions JU 
pp. 252], which eliminates the requirement for a discrete X. 

For the mutual information between X and X we can write 
with El pp. 251] 

I(X, X ) = J x j x fix& x ) lo § ( ^ X f X xtx) X) ) dXd£ ' 

(ID 

Similarily, with Lemma [T] (the logarithm and all PDFs are 
measurable) we get for I(X, Y) 



I{X-Y) 



fjtx( x > x ) lo § 



x Jx 



' f Ylx (x,g(x)) ' 
MM) 



dxdx 



(12) 



After subtracting these expressions according to ((10) we 
can exchange limit and integration (see Appendix). In the 
limit the conditional PDFs assume f x ^ x (x,x) = S(x — x) 

and fyixi'i') = tnus ©' an d using these we 

obtain ( TT3T l at the bottom of the next page. Since the integral 
over x is zero for x ^ x due to 8{x — x), only the term 
satisfying Xk = x remains from the sum over Dirac's deltas 
in the denominator; this term cancels with the delta in the 
numerator. Integrating over x and substituting © for /y(-) 
finally yields 



H(X\Y) 



fx(x) log 



X 



fx(xj) 

L,i£l{g{x)) \g'{xj)\ 
fx(x) 



dx (14) 



and completes the proof. 



Note that for X -> X both I{X;X) and I(X;Y) diverge 
to infinity, but their difference not necessarily does. Further, 
if for all y = g(x) the preimage is a singleton (|I (g(x)) \ = 1 
for all x £ X), g(-) is injective (thus bijective by Definition[TJ 
and the information loss H(X\Y) = 0. 

In Theorem Q] we provided an expression to calculate the 
information loss induced by a static nonlinearity. The follow- 
ing Theorem is of a different nature: It gives an explanation 
of why information is lost at all. Considering non-injective 
functions g(-) satisfying Definition [T] multiple input values 
may lead to the same output - the preimage of y may contain 
multiple elements. Given the output, the input is uncertain only 
w.r.t. which of these elements has been fed into the system 
under consideration. Due to the piecewise strict monotonicity 
of g(-) each subdomain contains at most one element of the 
preimage of y. Therefore, the information loss is identical to 
the uncertainty about the interval Xi from which the input 
x originated given the output value y. Before making this 



statement precise in Theorem |2] let us introduce the following 
Definition: 

Definition 2. Let W be a discrete RV with |W| = L mass 
points which is defined as 



W = Wi if x £ Xi 



(15) 



for all i — 



In other words, W is a discrete RV which depends on the 
interval Xi of x, and not on its actual value. As an immediate 
consequence of this Definition we obtain 



Pv(W 



fx{x)dx 



(16) 



Xi 



i.e., the probability mass contained in the i-th interval. 

In accordance with the model in Fig. Q] and the reasoning in 
the Appendix, one can think of W as X when the quantization 
bins are identical to Xi. While in Theorem [T] we required X 
to converge to X surely, the next Theorem shows that indeed 
such a convergence is not necessary as long as the quantization 
bins are chosen appropriately. This fact will then establish the 
link between the non-injectivity, piecewise strict monotonicity 
on intervals, and information loss. We are now ready to state 
the main Theorem: 

Theorem 2 (Main Theorem). The uncertainty about the input 
value x after observing the output y is identical to the 
uncertainty about the interval Xi from which the input was 
taken, i.e., 



H(X\Y) = H(W\Y). 



(17) 



Proof: Permitting continuous observations Y of a discrete 
random variable W, i.e., y C M, we can write for the 
conditional entropy ifTTI . |[T8l 



H(W\Y) = / H(W\Y = y)f Y (y)dy 
Jy 

L 



(18) 



= I y^p{w l \y)\ogp{w l \y)f Y {y)dy. (19) 
Jy ,_n 



The conditional probability of W given Y, p{wi\y) 
Pr(W^ = Wi\Y = y), can be calculated from ( TTBT l as 



p{wi\y) = / fx\y(x,y)dx (20) 

J Xi 

= t \ \ I fy\x(x,y)fx(x)dx (21) 



1 \ - f 5(x - x k ) 



kel(y) 



f x {x)dx (22) 



where we made use of (0 and exchanged the order of summa- 
tion and integration. Due to piecewise strict monotonicity of 
the function g(-), each interval contains at most one element 
of the preimage of y. Thus, if i g I (y), this element is given 



by 9i 1 (y) and 

Pimm = -f-rr / i ... fx{x)da 

jY{y) J Xi Wig, \y))\ 

fxjgrHv)) 
\9'{9i\y))\fY{y)' 

Conversely, if i £ I (y), we obtain p(wi\y) = 0. 

We are now ready to compute the conditional entropy of W 
given Y, i.e., the uncertainty about the interval from which x 
was drawn: 
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(24) 


X > 







H(W\Y) 



L 



p(wi\y) log p(wi\y)f Y (y)dy (25) 



\ - fxjxj) , . / 



fx{xi) 



dy 
(26) 



where we used Xi = g^ 1 (y). We can now substitute x = 
g^ 1 (y) for all / = 1, . . . , L where in each Xi only a single 
root remains. With gi(x) = g{x) on X\ we thus obtain 



H(W\Y) 



E 



log 



fx(xk) 
2 ^k€l(g(x)) \g'(x k )\ 

fx(x) 



rix 



(27) 



where we replaced fyigix)) by (|6]l and collapsed the sum of 
integrals over Xi to a single integral over X. Since this result 
is identical to (O, i.e., 



H(W\Y) = H(X\Y) 



(28) 



the proof is complete. ■ 
The information loss induced by a function satisfying Def- 
inition Q] is thus only related to the roots of the equation 
y = g(x). Conversely, if the interval Xi of x is known, the 
exact value of x can be reconstructed after observing y: 

H{X\Y) = H(X,W\Y) (29) 
= H(X\W,Y) + H(W\Y) (30) 
= H(X\W,Y) +H(X\Y) (31) 

and thus H(X\W,Y) = 0. 

Aside from the properties of conditional entropies (non- 
negativity (8, pp. 15], asymmetry in its arguments, etc.) the 
information loss has an important property concerning the 
cascade of deterministic, static systems, which is not shared 
by the conditional entropy in general. For such a cascade (see 



Z 



Fig. 2. Cascade of systems 

Fig. |2), which in the static case is equivalent to a composition 
of the implemented functions, we can prove the following 
Theorem: 

Theorem 3. Consider two functions g: X — > y and h: y — > Z 
satisfying Definition\l\and a cascade of systems implementing 
these functions, as shown in Fig. [2] Let Y — g{X) and Z = 
h(Y). Then, the information loss induced by this cascade, or 
equivalently, by the composition {h o <?)(•) = h(g(-)) is given 
by: 

H(X\Z) = H(X\Y) + H(Y\Z) (32) 

Proof: The proof starts by expanding H(X,Y\Z) 

H(X, Y\Z) = H{X\Y, Z) + H(Y\Z) 
= H{X\Y) + H{Y\Z) 

since X — > Y — » Z forms a Markov chain and thus X and Z 
are independent given Y E). Further, H(X, Y\Z) = H{X\Z) 
since Y is a function of X. Thus, 



H{X\Z) = H{X\Y) + H(Y\Z) 



(33) 



and the proof is complete. ■ 
Extending Theorem [3] we obtain the following Corollary: 

Corollary 1. Consider a set of functions g^. Xi-i — > Xi, 
i = 1, . . . , N, each satisfying Definition [7] and a cascade of 
systems implementing these functions. Let Xi, i = 1, 2, . . . , N, 
denote the output of the ith constituent system and, thus, the 
input of the (i + l)th system. Given the input of the first system, 
Xq, we have 



N 

H(X \X N ) = Y,H(X i . 

i=l 



\Xi). 



(34) 



Proof: The Corollary is proved by repeated application 
of Theorem [3] ■ 
This result does not imply that the order in which the 
functions can be arranged has no influence on the information 
loss of the cascade, as one would expect from stable, linear 
systems. Illustrative examples showing that this does not hold 
can be found, e.g., in JT9 |. Moreover, calculating the individual 
information losses requires in each case the PDF of the input 



H(X\Y)= Jim (l(X;X)-I(X;Y)) = I I f x (x)6{x - x ) log ( ^^^L*) 1 dMx < 13 > 

X ^ X JXJX \Jx( X )2^k£l(g( X )) \g>( Xk )\ 



to the function under consideration. While this does not seem 
to yield an improvement compared to a direct evaluation of (O, 
Theorem [3] can be used to bound the information loss of the 
cascade efficiently whenever bounds on the individual infor- 
mation losses are available. We will introduce such bounds in 
the next Section. 

IV. Upper Bounds on the Information Loss 

In many situations it might be inconvenient, or even impos- 
sible, to evaluate the information loss (0 analytically since it 
involves the logarithm of a sum, for which only inequalities 
exist [8|. Therefore, one has to resort to numeric integration 
or use bounds on the information loss which are simpler to 
evaluate. In this Section we derive an upper bound which 
requires only minor knowledge about the function g(-) - 
namely, the number of intervals L - and we show that this 
bound is tight. 

Theorem 4. The information loss induced by a function g(-) 
satisfying Definition\I]can be upper bounded by the following 
ordered set of inequalities: 

H(X\Y)< [ f Y (y) log (\l(y)\)dy (35) 
Jy 



Bound (1351 l holds with equality if and only if 

fx{x k ) \g'(x)\ 



E 



k<£l(g(x)) 



\g'(x k )\ fx(x) 



(36) 
(37) 

(38) 



is piecewise constant. If this expression is constant for all 
x G X, bound ( 136t is tight. Bound d37| l holds with equality 
if and only if additionally 3^ = y for all I = 1, . . . , L, and 
thus d38l evaluates to L. 



Proof: The proof relies upon Theorem [2] where we 
established H(X\Y) = H(W\Y) and thus 

H(X\Y)= f H(W\Y = y)f Y (y)dy. (39) 

Jy 

The first inequality, based upon the maximum entropy property 
of the uniform distribution JS] pp. 29], becomes an equality if 
P(m\y) = n^yy for all i G I(y), such that H(W\Y = y) = 
log (|I (y) |). Combining this with d24l i we have 



P(wi\y) 



fx{gi\y)) _ 1 
\9'(9i\v))\fY{y) \Hv) 



(40) 



Performing multiplicative inversion and inserting (|6) we obtain 

\g'{9l\y))\ 



\l(y)\ = fv(y) 

= E 



k<£l(y) 



fx{g;\y)) 
fx{x k ) \g'(gr\y))\ 
\9'(x k )\ fx(g^(y)) 



(41) 
(42) 



for all y G y and all i G I (y) C {1, . . . , L}. Since generally 
1 1 (y) | is piecewise constant and independent of i as long as 
i G I (y), it is immaterial which i from I (y) is chosen. Thus, 
one can exchange g~ x (y) by x and I(y) by I(g(x)), which 
proves the requirement of the first bound. 

The second inequality is due to Jensen [8, pp. 27]. Equality 
is achieved if |I (y) | is constant for all y G y, or equivalently 

fx(x k ) \g'(x)\ 



E 



fce%(x)) 



\g'{x k )\ f x (x) 



const. 



(43) 



for all x G X. 

If the requirements for equality in (1331 and d36l are met, 
the information loss is given as H(X\Y) = log(|I(y) |) for 
any y G y. Thus, the last bound, ( 137) . is tight if and only if 
|l(y) | = L for all y G y. This requires that 3^ = y for all 
I = 1, . . . , L and completes the proof. ■ 

An example of a function g(-) for which (l38l is piecewise 
constant assumes on each interval Xi the cumulative distri- 
bution function Fx(x), possibly modified with a sign and an 
additive constant. In other words, for all I = 1, . . . , L 



gi(x) = biF x (x) + ci 



(44) 



where bi G {1, —1} and c; G R are arbitrary constants. Such a 
function h: X — » y is depicted in Fig. [3] The constants q and 
the probability masses in each interval are constrained if (1381 
shall be constant. As a special case, consider this constant 
to be equal to L, which guarantees tightness in the largest 
bound (1371 . In order that appropriate constants c; exist, all 
intervals X\ have to contain the same probability mass, i.e., 



/ f x (x)dx = j. 



(45) 



Since equal probability mass in all intervals is a necessary, but 
not a sufficient condition for equality in ( f37l > (cf. Fig. 0, the 
constants q have to be set to 

Q = ~E / fx{x)dx = (46) 

i=l J X i 

where we assume that the intervals are ordered and where 6; = 
1 for all /. A function g: X — > Z satisfying these requirements 
is shown in Fig. [3] 

Another example of a function satisfying the tightness 
conditions of Theorem |4] is given in Example 1 of Section [V] 

V. Examples 

In this Section, the application of the obtained expression 
for the information loss and its upper bounds is illustrated. 
Unless otherwise noted, the logarithm is taken to base 2. 

A. Example 1: Even PDF, Magnitude Function 

Consider a continuous RV X with an even PDF, i.e., 
fx(-x) = f x (x). Let the support X — K and let this RV 
be the input to the magnitude function, i.e., 



g(x) 



x, 



if x < 
if x > 



(47) 



F x (x), g{x), h(x) 




Fig. 3. Piecewise strictly monotone functions with L = 3 satisfying conditions of Theorem [4] The function in blue, h: X — > y, renders B8t piecewise 
constant but not constant due to improper setting of the constants c;. Tightness is achieved in the smallest bound, {35), only. The function in red, g: X — > Z, 
satisfies all conditions (i.e., )38t is constant and Zi = Z for all I) and thus achieves equality in the largest bound d37l . Note that the subdomains X\ are 
chosen such that each subdomain contains the same probability mass. 



The magnitude function is piecewise strictly monotone on 
Xi = (—oo, 0) and X 2 = [0, oo), and with L = 2 we obtain 
the largest bound from Theorem |4] as 



H(X\Y) < log 2 = 1. 



(48) 



Both intervals are mapped to the positive (non-negative) real 
axis, i.e., J^i U {0} = y 2 = y = [0, oo), which implies that 
the second bound in Theorem |4] also yields H(X\Y) < 1. The 
magnitude of the first derivative of g(-) is equal to unity on 
both X\ and X 2 . There are two partial inverses mapping y to 
the subdomains of X: 

Xi = 9i 1 (y) = - y = ~g(x), and (49) 

X2 = 92 1 (y) = v = g{ x )- ( 5 °) 

Thus for all x £ X we have |I(g(a;)) | = 2, which renders 
the smallest bound of Theorem |4] as H(X\Y) < 1, Combin- 
ing d47b with the two partial inverses, we obtain for x € X±: 



x\ = x, and 
x 2 = - x. 



(51) 
(52) 



Conversely, for x e X 2 we have x\ = — x and x 2 = x. Using 
this in (O we obtain the information loss 



H(X\Y)= fx{x)\og^ 



fx(x) log 



X-2 



fx(x)+fx(-x) 

fx(x) 
fx(-x) + f x (x) 



fx(x) 



log 2 / f x {x)dx = l 



dx 



dx 



X 

which shows that all bounds of Theorem [4] are tight in this 
example. 

The conditional entropy is identical to one bit. In other 
words, if an RV with an even PDF (thus, with equal probability 
masses for positive and negative values) is fed through a 
magnitude function, one bit of information is lost. Despite 
the fact that this result seems obvious, this is the first time 
that it is derived for a continuous input to the best knowledge 
of the authors. 




X\ X 2 

Fig. 4. Piecewise strictly monotone function of Example 2 



B. Example 2: Zero-Mean Uniform PDF, Piecewise Strictly 
Monotone Function 

Consider an RV X uniformly distributed on [—a, a], a > 1, 
and a function g(-) defined as: 



9{x) 



(53) 



x 2 , if x < 
x, if x > 

This function, depicted in Fig. [4] is piecewise strictly mono- 
tone on (—oo,0) and [0, oo). We introduce the following 
partitioning: 

X x = [- a ,0)^y 1 = (0,a 2 ] (54) 
X 2 = [0,a] -^y 2 = [0,o] (55) 

Since the function is not differentiable, we define \g'(-)\ in a 
piecewise manner by the magnitude of the first derivatives of 

2\x\, ifx<0 
1, if x > 



W(x)\ 



(56) 



The two partial inverses on y 2 are given by 

x, if x < 



xi 



Vy = -ygix) 



-y/x, if X > 



xi = y = g(x) = 



x 2 , if x < 
x, if x > 



and (57) 



(58) 



while on y\y>2 = (a, a 2 ] only the root xi exists (i.e., this 
image is mapped to bijectively). Noticing that the information 
loss on the corresponding preimage X b — [—a, —\fa) is zero, 
we can write for the conditional entropy: 



H(X\Y) 



fx(x) log 



Xi\X b 



+ / /x(x)log 
Jx 2 



fx(x) 



dx 

dx 
(59) 



Since fx(%) = f° r all x, x 
integration ranges, we obtain 

H(X\Y) 



and — y/x in the designated 



2a 



log(l 



2a 



4a + 4^/a + 1 
8a 
log(2 v / S) 



2\x\)dx 
1 



2Vi 

log(2v^- 
1 



dx 



1) 



2 4Valn2 

where In is the natural logarithm. For a = 1 this approximates 
to H(X\Y) ps 0.922 bits. The information loss is slightly 
less than one bit, despite the fact that two equal probability 
masses collapse and the complete sign information is lost. 
This suggests that by observing the output part of the sign 
information can be retrieved. Looking at Fig. |4] one can see 
that for the subdomain located on the negative real axis, i.e., 
for X\, more probability mass is mapped to smaller outputs 
than to higher outputs. Thus, for a small output value y it 
is more likely that the input originated from X\ than from 
X2 (and vice-versa for large output values). Mathematically, 
this means that despite p(w\) = p{w2) — 0.5, we have 
p( w i\y) 7^ p{ w 2\y), which according to Theorem [2] plays a 
central role in computing the conditional entropy H(X\Y). 
By evaluating the bounds from Theorem |4] we obtain 



H(X\Y) < 



1 + y/a 



~ log {TP 



< 1 



(60) 



which for a = 1 all evaluate to 1 bit. The bounds are not tight 
as the conditions of Theorem 0] are not met in this case. 

C. Example 3: Normal PDF, Third-order Polynomial 

Finally, consider a Gaussian RV X ~ TV (0, cr 2 ) and the 
function 



j(x) = x 3 - lOOx 



(61) 



depicted in Fig. [5] An analytic computation of the information 
loss is prevented by the logarithm of a sum in (0. Still, we 
will show that with the help of Theorem [4] at least a bound 
on the information loss can be computed. 

Judging from the extrema of this function, three piecewise 
monotone intervals can be defined. Further, the domain which 
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Fig. 5. Third-order polynomial of Example 3 
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Fig. 6. Information loss of Example 3 



is mapped bijectively can be shown to be identical to 



20 
V3 



-^7=, ooj and contains a probability mass of 
/ 20 \ / 20 \ 

where Q(-) is the Q-function. With this result and the fact that 

Ph= ! fx(x)dx= I f Y (y)dy (63) 



fx(x)dx = / f Y {y)dy 
x b Jy b 

for a bijective mapping <?(•) between Xi, and we can upper 
bound the information loss by Theorem |4] 



H(X\Y)< f f Y (y)\og(\I(y)\)dy 
Jy 

f Y {y)\ogMy = (1 



(64) 

P 6 )log3 (65) 



y\y b 



since |I (y) | = 1 if y € y b and |I (y) | = 3 if y e 3> \ y b . In 
Fig. [6] this bound is illustrated together with the results from 
numerical integration of (0 and from Monte-Carlo simulations 
of the information loss. 

VI. Conclusion 

We presented an analytic expression for the information 
loss induced by a static nonlinearity. It was shown that this 
information loss is strongly related to the non-injectivity of the 
system, i.e., to the fact that a particular output can result from 
multiple possible inputs. Conversely, given a certain output, 
the input to the system under consideration is uncertain only 
with respect to the roots of the equation describing the system. 
The information loss can be utilized, e.g., for estimating the 
reconstruction error for nonlinearly distorted signals. 



Since the obtained expression involves the integral over the 
logarithm of a sum, bounds on the information loss were 
derived which can be computed with much less difficulty. 
In particular, it was shown that the information loss for a 
piecewise strictly monotone function is upper bounded by the 
logarithm of the number of subdomains, and that this bound 
is tight. 

Generalizations of these results to rates of information loss 
and nonlinear systems with memory, as well as the extension 
to discrete random variables are the object of future work. 
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Appendix 

We have yet to show that in the proof of Theorem Q] limit 
and integration can be exchanged. Since both I(X; X) and 
I(X; Y) are an expectation over the same joint PDF (cf. (fTTt 
and (V2\), we can express their difference as a single integral. 
Splitting the logarithm of a product in a sum of logarithms, 
we obtain 



H(X\Y)= lim (l(X;X) - I(X;Y)) 
x^x V / 



lim 



fxx( x > x ) lo § 



+ / / fxx( x > x ) l °- 



fY(g(x)) \ 
fx(x) y 

fx\x( x > x ) 
f Y \x( x >9(x)) 



dxdx 



dxdx 



Note that the integration ranges have been omitted due to space 
limitations. The first double integral can be reduced to a single 
integral and taken out of the limit, since the logarithm does 
not depend on x. For the second integral we first invoke the 
method of transformation (e.g., 0) to obtain 



fx\x( x > x ) 



\g'( x )\ 



E 

i£l(g(x)) 



\9'{x)\f xl x(x,Xj) 
\9'{%i)\fx\x( x > x ) 



Using this result in the integral above and splitting again the 
logarithm, one obtains ( l66"l i. Note that one element in the sum 
in the logarithm is identical to one, since the preimage of g(x) 
contains x. All other elements of the sum are non-negative, so 
the integral is taken over a positive function. 

Now let X approach X in a special way: As depicted 
in Fig. Q] X is a (non-uniformly) quantized version of X. 
We perform the limit as a sequence of refinements of the 



partitioning, such that eventually the step sizes approach zero. 
X can thus be viewed as the sum of X and a signal-dependent 
noise term N with support The subscript x indicates the 
dependence of the quantization step size on the quantization 
bin, i.e., the non-uniformity of the quantizer. Let Q(X) denote 
the discrete values X can assume. We then obtain for the 
conditional PDF of X given X 



fx\xi^i x ) 



f x ( X )X{\x-Q(x)\ 



F x {Q(x) 



A.;. 
2 



F x (Q(x) 



Ad 

2 



(67) 



where X{ } is the indicator function, assuming one whenever 
the condition in the argument is fulfilled and zero otherwise. 

The refinement of the partitioning suggests a converging 
sequence of functions as the argument of the integral in (1551 . 
At some point in the sequence we can assume that, for all 
x, the partitioning is so fine that no two elements of the 
preimage of g(x) lie in the same quantization bin. This is 
trivially fulfilled if the end points of all intervals X\ (cf. 
Definition [TJ are quantization thresholds. In this case the sum 
in the logarithm in (|66| i is identical to one, which renders 
the argument of the integral zero. Since any refinement of 
a partitioning does not change existing, but only adds new 
quantization thresholds, the integral remains zero. 

While this analysis yields yet another proof for Theorem Q] 
it also allows us to upper bound the argument of the second in- 
tegral in (|66| | by the zero function. Thus, invoking Lebesgue's 
dominated convergence theorem allows us to exchange limit 
and integration. 
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